In this document I will describe how various factors affect attrition rate in IBM and find out factors which can help reducing attrition rate by avoiding some aspects which is explainatory.
IBM provides pseudo-dataset that shows key factors from which we can see patterns regarding performance and attrition of an employee. This notebook will focus on finding out factors which directly affects attrition rate.
I will proceed to analysis with some common questions in target like -
Whether no promotion leads to attrition ?
Whether less income leads to attrition ?
Is the person leaves job frequently ?
and many more.
Lets start by exploring the dataset by summarizing it -
## 'data.frame': 1470 obs. of 35 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
Dropping Standard Hours and Employee count from dataframe as its un-necessary fot our analysis.
df$StandardHours <- NULL
df$EmployeeCount <- NULL
Checking if there are any missing values in dataset by checking dimension -
## [1] 1470 33
Since dimension is same as the values in summary there are no missing values. Hence we can continue with our exploration.
Lets explore demographics of our datset like Age,Gender,Attrition and Distance From Home
Distribution of Age - Age is normally distributed with mean age around 35 and minimum age being 18 and maximum bieng 60 (Retirement). This suggests there are no unusual data in dataset.
Lets summarise this feature to explain findings -
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 30.00 36.00 36.92 43.00 60.00
Distribution of Gender - Female are less than Male which is not unusual in workplace. While comparing something wrt Gender we need to take care of this.
Lets summarise this feature to explain findings -
## Female Male
## 588 882
Distribution of Attrition - This is also normal since attrition is always very less in a well established company.
Lets summarise this feature to explain findings -
## No Yes
## 1233 237
Distribution of Distance From Home - This data is interesting to look as this follows common idea of living near the office. Interesting part will be If someone leaves company due to this factor.
Lets summarise this feature to explain findings -
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 7.000 9.193 14.000 29.000
Additionally lets explore some more features -
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Distribution of Travel - Graph suggests most of the employee are from travel rarely job type. This will be a good feature to see attrition rate.
Lets summarise this feature to explain findings -
## Non-Travel Travel_Frequently Travel_Rarely
## 150 277 1043
Distrubution of Job Level - Dataset contains more people from lower level job which is normal hence not many people are present in higher level job.
Lets summarise this feature to explain findings -
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.064 3.000 5.000
Distribution of Experience - Similar to job level low experience employee are more because of the same reason explained above.
Lets summarise this feature to explain findings -
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 6.00 10.00 11.28 15.00 40.00
Distribution of Income - This is also related to age,experience and job level hence same pattern can be observed here.
Lets summarise this feature to explain findings -
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1009 2911 4919 6503 8379 19999
All this relation can be explored from correlation plot which I will be doing in bivariate plot section.
The dataset includes all details of employee with some factor variables like Attrition,Business Travel,Department,etc. Others are Integer like distance from home, Monthly Income, etc. There are no other datatypes and no missing values.
Some interesting variables that definitely will help in finding out factors affecting attrition are one and foremost Attrition, others are Age, Business Travel, Distance From Home, Monthly Income, Overtime and Work-Life Balance
Features like time since last promotion, Companies worked, Gender and Maritial Status will help exploring in deep.
Before starting bivariate plots, lets compare correlation between various components such that we can reduce unnecessary comparision between related components and focus on important variables.
From above correlation graph we can deduce that following data are strongly correlated (Dark blue color represents high correlatoion)-
1. Age ~ JobLevel ~ MonthlyIncome ~ TotalWorkingYears ~ YearsAtCompany -
This makes sense as age increases Salary, Job Level and Monthly Income increases. Lets use scatter plot to plot all these variables to look in detail.
As from the graph we can see all these features increases as age increase.
2. JobLevel ~ YearsAtCompany ~ TotalWorkingYears
This is interesting relation, as job level increases people remain in the same company. Lets see with scatter plot
One strange finding here is job level is not so dependent factor with Years at company as compared to total working years. People working in same companies has different level with same age while this is not true for total experience.
3. PercentSalaryHike ~ PerformanceRating
As performance rating improves hike is imminent, this relationship can be seen here as increase in rating gives more hike.
Since dataset contains only rating of 3 and 4. We can easily conclude people with rating 3-3.5 get less than 20% hike and people with rating 3.5-4 get more than 20% hike.
4. YearsAtCompany ~ YearsInCurrentRole ~ YearsSinceLastPromotion ~ YearsWithCurrManager
This has shown some interesting relation like years at company and promotion. With increase in experience at same company primoition does not happen as usual and role is also not getting changed frequently. These two factor contribute to longer time with current manager.
Now after exploring relationships with dataset we can start with our posed question. Which factors attrition depends on?
As there is no direct relation of attrition with other values lets start looking into factors one by one-
Age is one of the major factor attrition depends on. Young people tends to switch more job than older people. Lets explore the pattern. From here on I will be producing plots in proportion. Proportion will be comparable to actual variable and attrition at that variable.
Plotting histogram of Age will answer most of the questions since age is related to numerous factors in dataset.
From the graph above it is clear that people who joined company at young age tend to leave the company either for higher education or for other company. The pattern reduces over time and after 50 years it starts increasing again due to factors like retirement.
Since age and total working years are related we can see the same pattern here. With start of career attrition is more and it reduces afterwards. At age near to retirement around 40 Attrition rate increases exponentially. Same goes for JobLevel and Monthly Income.
Lets plot some more variable based on attrition
As expected people who have to travel frequently has higher attrition.
Same trend is followed in Years at company initial attrition is higher and after 23 years attrition rate begins to increase. Which can amount to people taking retirement.
Years in current role which may change after promotion usually shows either people leave just after getting promoted or they leave because they are not getting new role. But pattern is non explainatory
Option Level and Percent Salary Hike
Popular opinion will suggest that years since last promotion matters for attrition but based on our data its hard to deduce that not getting promoted is reason for attrition. Also IBM employee hierarchy is different then other companies which has fewer job profile hence less promotion but pay increases overtime.
Work-Life balance is a factor in attrition as people reporting low work-life balance have very high attrition compared to others. Same goes for stock option level, lower level leave and higher level attrition can be due to people retiring
Again percent salary hike shows that people getting more hike leave after the hike. Which is true as other companies will try to match up the salary given in the previos job. But difference is not large enough to conclude this.
As its clear that overtime does have an affect on attrition as people doing overtime tends to leave company.
Interestingly people worked with more than 5 companies does have more attrition than others. But pattern is not clear hence we cannot conclude.
As expected people with MaritialStatus single has more attrition as after getting married leaving job is very rare since supporting family is an issue.
Job satisfaction is also an issue as people with low job satisfaction leave the job.
Again job envolvement and environment satisfaction is directly affecting attrition.
HR, Marketing and Technical Degree does have more attrition than any other field.
Attrition does not depend much on Education but we can say that people with highest level of education have low attrition.
Distance from home may not look like a factor for attrition but people living in larger distance from office does have more attrition.
It may look like department effects attrition but there are no clear patterns to conclude that.
Most interesting feature was age, its directly related to other features and exploring only age answered most of the questions like less income, job level and new joinee has more attrition than others.
Other feature which was good to look at was Overtime. Attrition rate for employee having overtime was much more compared to other employee.
Relationship between Years at company and Last promotion was interesting as More years at company promotion was not happening as usual. Which is contarary to popular belief. Also role was not getting changed.
Strongest relationship was between Age an Monthly Income. As age increased, income of an individual increases.
Idea of this multivariate plot was to see if increase in age and travel are related factor for attrition. Lets see with plot
Although attrition in non-travel job type is very low but above age 40 attrition is almost nill. While other travel type still sees some attrition. Another conclusion we can give is for travel-frequently below age 25 attrition is almost 100%.
Two feature that I wanted to see that if someone is married and there is travel required. Will people leave the job. Lets look at plot
After looking at plot we can definitely say there are no clear pattern visible here. Hence there is no effect of this reason on attrition.
Since distance and age can be a factor on leaving job as people get tired travelling also adding to factor is if someone has a family. Lets look at plot.
Here also there are no clear pattern hence assumption is false.
Gender based analysis can give some clear result about weather gender is a factor on attrition due to travel.
Here also we can see gender is not that big factor as there is no clear pattern.
Adding one more variable Marritial Status for above plot. Lets explore,
Here we can see some pattern that Married people has more attrition at larger distance compare to single and divorced.
One definitive relation was Business Travel does play role in attrition as Travel Frequently has more attrition at age less than 25 and Non travel job has almost no attrition after age 40. While there is no effect of travel rarely job on attrition.
Other relationship was married people have more attrition when distance from home is large compare to other.
Surprising interaction is Gender plays no role in attrition when compared to any of the feature in dataset. While popular opinion suggests otherwise.
Reason behind including this plot in summary was the increasing pattern of experience till 10 years and reducing after that which suggests most of the employee start retiring after working 10 years and 40 is the max limit which is Retirement age.
Age is one of the feature which related to several other features hence we were able to directly deduce feature like Monthly Income, years at company , etc has its effect in attrition in same pattern related to age.
Although data points were not equal but from this plot we were able to see that in Non-Travel job people with age more than 40 had more attrition. Simmilarly people travelling frequently had allot of attrition below age 30(almost 80%).
Whole analysis was based on finding out which factors affect Attrition. Based on analysis following factors came up -
One of the problem faced with dataset is its not given what was the reason attrition has happened. Including it will make the analysis more exciting. Also dataset is not enough to conclude some of the things.
Surprisingly attrition is happening for the same reason as percieved. But the finding that attrition is not related to promotion was great finding in the analysis..
Future work on this can be to provide a model where we can give details how to reduce attrition by avoiding some factors.